In this exercise, you will use the Diabetes dataset (built into scikit-learn) to explore how Ridge and Lasso regression handle correlated predictors and the effects of regularization. You will compare them to a standard Linear Regression model, visualize their coefficients, and interpret how regularization influences model complexity and performance.
You will also examine the role of interaction features, exploring how they can add flexibility to the model and help reduce bias.
We will conduct two sets of experiments:
- Using only the original features, and
- Using the original features plus interaction terms.
Learning Objectives¶
By the end of this exercise, you will be able to:
- Apply Ridge and Lasso regression to control model complexity and prevent overfitting.
- Use cross-validation to select the optimal level of regularization (
alpha). - Interpret how regularization affects coefficient magnitude, sparsity, and model interpretability.
- Evaluate how adding interaction features changes model bias, variance, and predictive performance.
📘 Step 1: Load & Understand the Diabetes Dataset¶
A crucial point to understand when working with the diabetes dataset from sklearn.datasets is that it is not raw data. The feature data (X) has already been pre-processed according to a specific normalization scheme.
This is noted by the original author on the data's source webpage (source):
> Note that the 10 x variables have been standardized to have mean 0 and squared length = 1 (sum(x^2)=1).
To make this clear for anyone using this notebook, we will load both the raw data from the source URL and compare it to the pre-processed data that comes with scikit-learn. This allows for a direct comparison and highlights what "pre-processing" has already been done.
👉 The Normalization Equation Used by the Dataset Authors:¶
The specific transformation applied to each of the 10 feature columns (e.g., AGE, BMI, etc.) is as follows:
$$ x_{\text{norm}_{ij}} = \frac{x_{ij} - \bar{x}_j}{\sqrt{\sum_{k=1}^{n} (x_{kj} - \bar{x}_j)^2}} $$
Where:
- $x_{ij}$ is the original value of feature $j$ for sample $i$.
- $\bar{x}_j$ is the mean of all original values for feature $j$.
- $n$ is the total number of samples.
- The summation in the denominator is over all samples $k$ for that feature $j$.
This process ensures two properties for each resulting feature column:
- Mean of 0: The data is centered.
- Sum of Squares of 1: The vector for each feature has a unit L2-norm.
👉 Data Scaling Equation Used in StandardScaler¶
The scaling transformation that is used in StandardScaler can be expressed mathematically as:
$$ X_{\text{scaled}} = \frac{X_{\text{train}} - \mu_{\text{train}}}{\sigma_{\text{train}}} $$
Where:
- $X_{\text{train}}$ is the original feature value from the training data.
- $\mu_{\text{train}}$ is the mean of the feature values computed from the training data.
- $\sigma_{\text{train}}$ is the standard deviation of the feature values computed from the training data.
For the test data, the scaling transformation is applied as:
$$ X_{\text{test, scaled}} = \frac{X_{\text{test}} - \mu_{\text{train}}}{\sigma_{\text{train}}} $$
Where:
- $X_{\text{test}}$ is the original feature value from the test data.
- $\mu_{\text{train}}$ and $\sigma_{\text{train}}$ are the mean and standard deviation computed from the training data, respectively.
Note: It is crucial to use the mean ($\mu_{\text{train}}$) and standard deviation ($\sigma_{\text{train}}$) from the training data for scaling the test data. This ensures that the test data is transformed in the same way as the training data, maintaining consistency and preventing data leakage.
To understand how the Diabetes dataset was preprocessed, we compare the raw data we downloaded with the version included in sklearn.datasets.load_diabetes. This comparison shows that scikit-learn’s dataset has already been normalized according to the transformation described by equation above.
Let’s start by loading the dataset from sklearn.datasets and from the URL and reviewing them.
import pandas as pd
# The URL of the raw data
url = 'https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt'
# Use pandas to read the tab-separated file directly from the URL
# sep='\t' tells pandas that the columns are separated by tabs
try:
df_raw = pd.read_csv(url, sep='\t')
# Now the data is in a pandas DataFrame called 'df'
print("Raw data loaded successfully!")
display(df_raw.head())
print("")
print("The data shape is: ", df_raw.shape)
# --- To save this data to a local file (e.g., a CSV) ---
# This is useful if you want a local copy for offline use
output_filename = 'diabetes_data.csv'
df_raw.to_csv(output_filename, index=False) # index=False prevents pandas from writing the DataFrame index as a column
print(f"\nData saved locally to '{output_filename}'")
except Exception as e:
print(f"An error occurred: {e}")
Raw data loaded successfully!
| AGE | SEX | BMI | BP | S1 | S2 | S3 | S4 | S5 | S6 | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59 | 2 | 32.1 | 101.0 | 157 | 93.2 | 38.0 | 4.0 | 4.8598 | 87 | 151 |
| 1 | 48 | 1 | 21.6 | 87.0 | 183 | 103.2 | 70.0 | 3.0 | 3.8918 | 69 | 75 |
| 2 | 72 | 2 | 30.5 | 93.0 | 156 | 93.6 | 41.0 | 4.0 | 4.6728 | 85 | 141 |
| 3 | 24 | 1 | 25.3 | 84.0 | 198 | 131.4 | 40.0 | 5.0 | 4.8903 | 89 | 206 |
| 4 | 50 | 1 | 23.0 | 101.0 | 192 | 125.4 | 52.0 | 4.0 | 4.2905 | 80 | 135 |
The data shape is: (442, 11) Data saved locally to 'diabetes_data.csv'
And loading the data from sklearn.datasets which are normalized.¶
from sklearn.datasets import load_diabetes
import pandas as pd
# Load dataset
data = load_diabetes(as_frame=True)
X = data.data
y = data.target
print(f"Shape of the features dataframe : {X.shape}")
print("")
print("First 5 rows from the sklearn dataset [Already normalized]:")
X.head()
Shape of the features dataframe : (442, 10) First 5 rows from the sklearn dataset [Already normalized]:
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 |
Let's get more information of the dataset.
print(data.DESCR)
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
Feature descriptions: Below is the feature descriptions for the diabetes dataset. Note that the data is already scaled.
feature_info = {
"age": "Age (years)",
"sex": "Sex (1 = male, 2 = female)",
"bmi": "Body mass index (kg/m²)",
"bp": "Average blood pressure",
"s1": "TC – T-Cell count (cholesterol-related)",
"s2": "LDL – Low-Density Lipoproteins (bad cholesterol)",
"s3": "HDL – High-Density Lipoproteins (good cholesterol)",
"s4": "TCH – Total Cholesterol",
"s5": "LTG – Log of serum triglycerides level",
"s6": "GLU – Blood sugar level (glucose)"
}
import pandas as pd
pd.DataFrame(feature_info.items(), columns=["Feature", "Description"])
| Feature | Description | |
|---|---|---|
| 0 | age | Age (years) |
| 1 | sex | Sex (1 = male, 2 = female) |
| 2 | bmi | Body mass index (kg/m²) |
| 3 | bp | Average blood pressure |
| 4 | s1 | TC – T-Cell count (cholesterol-related) |
| 5 | s2 | LDL – Low-Density Lipoproteins (bad cholesterol) |
| 6 | s3 | HDL – High-Density Lipoproteins (good choleste... |
| 7 | s4 | TCH – Total Cholesterol |
| 8 | s5 | LTG – Log of serum triglycerides level |
| 9 | s6 | GLU – Blood sugar level (glucose) |
Let's performe the follwoing calculations:
- Normalizing the raw data using the equation shown above. It should give the same results stored in the
sklearn.datasets.diabetes - Applying
StandardScaleron the raw data from thesklearn.preprocessing - Applying
StandardScaleron the normalized data
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Exact feature list (order fixed)
FEATURES = ["AGE","SEX","BMI","BP","S1","S2","S3","S4","S5","S6"]
X = df_raw[FEATURES].astype(float) # force float and fixed order
# -------- A) Sklearn-style z-score (population std, ddof=0) --------
scaler = StandardScaler(with_mean=True, with_std=True) # default ddof=0
X_std_sklearn = pd.DataFrame(scaler.fit_transform(X), columns=[f"{c}_std" for c in FEATURES])
# -------- B) Diabetes normalization (mean 0, sum of squares = 1) --------
X_diab = pd.DataFrame(index=X.index)
for c in FEATURES:
centered = X[c] - X[c].mean()
denom = np.sqrt((centered**2).sum())
X_diab[f"{c}_diab"] = centered / denom
# -------- C) Raw columns (for side-by-side) --------
X_raw = X.copy()
X_raw.columns = [f"{c}_raw" for c in FEATURES]
# -------- D) Combine in grouped order: raw, std (sklearn), diab (and optional sample-std) --------
cols_grouped = []
for c in FEATURES:
cols_grouped += [f"{c}_raw", f"{c}_std", f"{c}_diab"] # add f"{c}_std_sample" if you want to compare
combined = pd.concat([X_raw, X_std_sklearn, X_diab], axis=1)[cols_grouped]
# -------- E) Sanity checks (optional) --------
# 1) Sklearn z-scores should have ~0 mean, ~1 variance (population variance close to 1)
# 2) Diabetes normalization should have ~0 mean and sum of squares == 1 per column
# print(combined.head())
# print("Sklearn means ~0:\n", X_std_sklearn.mean().round(6))
# print("Sklearn var ~1 (np.var ddof=0):\n", np.var(X_std_sklearn.values, axis=0).round(6))
# print("Diabetes means ~0:\n", X_diab.mean().round(6))
# print("Diabetes sumsq = 1:\n", (X_diab**2).sum().round(6))
# If you only want to see the table:
#pd.set_option('display.max_columns', None)
#print(combined.head().round(6))
display("This is the 5 first rows of raw data of diabetes normalization:",X_raw.head().round(6))
X_std_sklearn_normalize = pd.DataFrame(scaler.fit_transform(X_raw), columns=[f"{c}_std" for c in FEATURES])
print("This is the 5 first rows of diabetes after normalization [Same as the tabulated data in sklearn.datasets] :")
display(X_diab.head().round(6))
print("This is the data after applying standardscaler on the diabetes raw data :")
display(X_std_sklearn_normalize.head().round(6))
X_std_sklearn_normalize_scale = pd.DataFrame(scaler.fit_transform(X_diab), columns=[f"{c}_std" for c in FEATURES])
print("This is the data after applying standardscaler on the diabetes normalized data \n[gives same data to the standardized diabetes]:")
display(X_std_sklearn_normalize_scale.head().round(6))
'This is the 5 first rows of raw data of diabetes normalization:'
| AGE_raw | SEX_raw | BMI_raw | BP_raw | S1_raw | S2_raw | S3_raw | S4_raw | S5_raw | S6_raw | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59.0 | 2.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.0 | 4.8598 | 87.0 |
| 1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.0 | 3.8918 | 69.0 |
| 2 | 72.0 | 2.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.0 | 4.6728 | 85.0 |
| 3 | 24.0 | 1.0 | 25.3 | 84.0 | 198.0 | 131.4 | 40.0 | 5.0 | 4.8903 | 89.0 |
| 4 | 50.0 | 1.0 | 23.0 | 101.0 | 192.0 | 125.4 | 52.0 | 4.0 | 4.2905 | 80.0 |
This is the 5 first rows of diabetes after normalization [Same as the tabulated data in sklearn.datasets] :
| AGE_diab | SEX_diab | BMI_diab | BP_diab | S1_diab | S2_diab | S3_diab | S4_diab | S5_diab | S6_diab | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 |
This is the data after applying standardscaler on the diabetes raw data :
| AGE_std | SEX_std | BMI_std | BP_std | S1_std | S2_std | S3_std | S4_std | S5_std | S6_std | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.800500 | 1.065488 | 1.297088 | 0.459841 | -0.929746 | -0.732065 | -0.912451 | -0.054499 | 0.418531 | -0.370989 |
| 1 | -0.039567 | -0.938537 | -1.082180 | -0.553505 | -0.177624 | -0.402886 | 1.564414 | -0.830301 | -1.436589 | -1.938479 |
| 2 | 1.793307 | 1.065488 | 0.934533 | -0.119214 | -0.958674 | -0.718897 | -0.680245 | -0.054499 | 0.060156 | -0.545154 |
| 3 | -1.872441 | -0.938537 | -0.243771 | -0.770650 | 0.256292 | 0.525397 | -0.757647 | 0.721302 | 0.476983 | -0.196823 |
| 4 | 0.113172 | -0.938537 | -0.764944 | 0.459841 | 0.082726 | 0.327890 | 0.171178 | -0.054499 | -0.672502 | -0.980568 |
This is the data after applying standardscaler on the diabetes normalized data [gives same data to the standardized diabetes]:
| AGE_std | SEX_std | BMI_std | BP_std | S1_std | S2_std | S3_std | S4_std | S5_std | S6_std | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.800500 | 1.065488 | 1.297088 | 0.459841 | -0.929746 | -0.732065 | -0.912451 | -0.054499 | 0.418531 | -0.370989 |
| 1 | -0.039567 | -0.938537 | -1.082180 | -0.553505 | -0.177624 | -0.402886 | 1.564414 | -0.830301 | -1.436589 | -1.938479 |
| 2 | 1.793307 | 1.065488 | 0.934533 | -0.119214 | -0.958674 | -0.718897 | -0.680245 | -0.054499 | 0.060156 | -0.545154 |
| 3 | -1.872441 | -0.938537 | -0.243771 | -0.770650 | 0.256292 | 0.525397 | -0.757647 | 0.721302 | 0.476983 | -0.196823 |
| 4 | 0.113172 | -0.938537 | -0.764944 | 0.459841 | 0.082726 | 0.327890 | 0.171178 | -0.054499 | -0.672502 | -0.980568 |
Now let's print the statistical parameters for each set of data.¶
# Statistical parameters for raw data
means_raw = X_raw.mean()
stds_raw = X_raw.std(ddof=0) # population std
# Statistical parameters for diabetes-normalized data
means_diab = X_diab.mean()
stds_diab = X_diab.std(ddof=0) # population std
#statistical parameters for normalized and standardized data
means_std = X_std_sklearn_normalize_scale.mean()
stds_std = X_std_sklearn_normalize_scale.std(ddof=0) # population std
print("Means of raw data:\n", means_raw.round(6))
print("Standard deviations of raw data:\n", stds_raw.round(6))
print("\nMeans of diabetes-normalized data:\n", means_diab.round(6))
print("Standard deviations of diabetes-normalized data:\n", stds_diab.round(6))
print("\nMeans of normalized and standardized data:\n", means_std.round(6))
print("Standard deviations of normalized and standardized data:\n", stds_std.round(6))
# Assume 'x_diab' is your existing DataFrame.
# For a runnable example, let's create a sample DataFrame.
# In your actual code, you would already have 'x_diab' loaded.
# Sample DataFrame creation (comment this out if you already have 'x_diab')
# import pandas as pd
# Define the output filename
output_filename = 'x_diab_data.csv'
# Save the DataFrame to a CSV file without the index
X_diab.to_csv(output_filename, index=False)
#print(f"DataFrame successfully saved to '{output_filename}'")
Means of raw data: AGE_raw 48.518100 SEX_raw 1.468326 BMI_raw 26.375792 BP_raw 94.647014 S1_raw 189.140271 S2_raw 115.439140 S3_raw 49.788462 S4_raw 4.070249 S5_raw 4.641411 S6_raw 91.260181 dtype: float64 Standard deviations of raw data: AGE_raw 13.094190 SEX_raw 0.498996 BMI_raw 4.413121 BP_raw 13.815628 S1_raw 34.568880 S2_raw 30.378658 S3_raw 12.919562 S4_raw 1.288989 S5_raw 0.521799 S6_raw 11.483322 dtype: float64 Means of diabetes-normalized data: AGE_diab -0.0 SEX_diab 0.0 BMI_diab 0.0 BP_diab 0.0 S1_diab -0.0 S2_diab -0.0 S3_diab -0.0 S4_diab -0.0 S5_diab 0.0 S6_diab 0.0 dtype: float64 Standard deviations of diabetes-normalized data: AGE_diab 0.047565 SEX_diab 0.047565 BMI_diab 0.047565 BP_diab 0.047565 S1_diab 0.047565 S2_diab 0.047565 S3_diab 0.047565 S4_diab 0.047565 S5_diab 0.047565 S6_diab 0.047565 dtype: float64 Means of normalized and standardized data: AGE_std 0.0 SEX_std -0.0 BMI_std -0.0 BP_std 0.0 S1_std -0.0 S2_std -0.0 S3_std 0.0 S4_std 0.0 S5_std -0.0 S6_std 0.0 dtype: float64 Standard deviations of normalized and standardized data: AGE_std 1.0 SEX_std 1.0 BMI_std 1.0 BP_std 1.0 S1_std 1.0 S2_std 1.0 S3_std 1.0 S4_std 1.0 S5_std 1.0 S6_std 1.0 dtype: float64
Index(['AGE_std', 'SEX_std', 'BMI_std', 'BP_std', 'S1_std', 'S2_std', 'S3_std',
'S4_std', 'S5_std', 'S6_std'],
dtype='object')
Data Visualization¶
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Combine X and y into one dataframe
df_raw = X.copy()
df_raw['target'] = y
# Normalize column names to lowercase for consistency
df_raw.columns = [c.lower() for c in df_raw.columns]
# Define the most correlated features (based on your correlation matrix)
highlight_features = ['age', 'bmi', 'bp', 's1', 's2', 's5', 's6']
correlated_features = ['bmi', 'bp', 's5']
# Select variables for the pair plot (only ones that exist)
vars_for_plot = [c for c in highlight_features + ['target'] if c in df_raw.columns]
# Pair plot setup
pairplot = sns.pairplot(
df_raw,
vars=vars_for_plot,
diag_kind='kde',
height=2.3,
plot_kws={'alpha': 0.6, 's': 20, 'edgecolor': 'k'}
)
# Restore shaded backgrounds for target vs key features
for i, ax_row in enumerate(pairplot.axes):
for j, ax in enumerate(ax_row):
if ax is not None:
x_label = vars_for_plot[j]
y_label = vars_for_plot[i]
if (x_label == 'target' and y_label in correlated_features) or \
(y_label == 'target' and x_label in correlated_features):
ax.set_facecolor('lightyellow') # highlight only target-feature comparisons
# Add overall title
plt.suptitle(
"Pair Plot of Features and Target\n(Shaded: Target vs. Most Related Features): Raw Data",
y=1.02,
fontsize=13
)
plt.show()
Interpretation of Pairplot¶
The pairplot provides a visual representation of the relationships between selected features (age, bmi, bp, s1, s2) and the target variable. Key observations:
bmiandbpshow a relatively strong positive relationship with the target variable, indicating that higher values of these features are associated with higher target values.- Some features, such as
age, show weaker or less clear relationships with the target variable. - The diagonal KDE plots show the distribution of each feature, which appears to be centered around zero due to scaling.
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df_raw.corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title("Correlation Heatmap")
plt.show()
Interpretation of Heatmap¶
The heatmap displays the correlation matrix, showing the strength and direction of linear relationships between features and the target variable. Key observations:
bmihas the highest positive correlation with the target variable (0.586), followed bybp(0.441) ands5(0.566).- Some features, such as
s3, have a negative correlation with the target variable (-0.395). - Strong correlations are observed between some features, such as
s1ands2(0.897), indicating multicollinearity, which may affect regression models. - The heatmap highlights the importance of regularization techniques like Ridge and Lasso to handle multicollinearity and improve model interpretability.
# Box plot for all features RAW data (excluding 'sex')
plt.figure(figsize=(12, 8))
for i, column in enumerate([col for col in df_raw.columns[:-1] if col != 'sex'], 1): # Exclude the target column and 'sex'
plt.subplot(3, 4, i) # Adjust grid size based on the number of features
sns.boxplot(y=df_raw[column], color="skyblue")
plt.title(column)
plt.tight_layout()
#print(i,column)
plt.suptitle("Box Plots for All Features (Excluding 'Sex'): Raw data", y=1.02)
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
# Box plot for all features (excluding 'SEX_std')
plt.figure(figsize=(12, 8))
# Loop through all columns except 'SEX_std'
for i, column in enumerate([col for col in X_std_sklearn_normalize_scale.columns if col != 'SEX_std'], 1):
plt.subplot(3, 4, i)
sns.boxplot(y=X_std_sklearn_normalize_scale[column], color="lightgreen")
plt.title(column)
#print(i, column)
# Adjust layout AFTER all subplots are created
plt.tight_layout()
plt.suptitle("Box Plots for All Features (Excluding 'Sex'): Scaled Data", y=1.02)
plt.show()
#X_std_sklearn_normalize_scale.columns
Interpretation of Box Plot¶
The box plots provide insights into the distribution of each feature (excluding sex). Key observations:
- For numerical features, the box plots highlight the spread, central tendency (median), and potential outliers.
- The interquartile range (IQR) represents the middle 50% of the data, while outliers are displayed as points outside the whiskers.
- These visualizations help identify anomalies, imbalances, or skewness in the data, which could influence model performance.
Detailed Example: Box Plot of bmi¶
The box plot of bmi (Body Mass Index) provides a detailed view of its distribution:
- Median: The line inside the box represents the median BMI value, which is the central tendency of the data.
- Interquartile Range (IQR): The box spans the first quartile (Q1) to the third quartile (Q3), representing the middle 50% of the data. This range indicates the typical spread of BMI values in the dataset.
- Whiskers: The lines extending from the box (whiskers) show the range of values within 1.5 times the IQR. Values beyond this range are considered potential outliers.
- Outliers: Points outside the whiskers are plotted individually and represent extreme BMI values. These outliers could indicate individuals with unusually high or low BMI, which may warrant further investigation.
By examining the box plot of bmi, we can:
- Detect Skewness: If the median is closer to the bottom or top of the box, it suggests skewness in the data.
- Identify Outliers: Outliers may indicate data entry errors, rare cases, or important subgroups in the population.
- Compare Distributions: If comparing
bmiacross different groups (e.g., male vs. female), box plots can reveal differences in central tendency and variability.
Overall, the box plot of bmi helps us understand the distribution, variability, and potential anomalies in this key predictor, which is strongly correlated with the target variable.
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
# Histogram for all features in raw data (excluding 'SEX_raw')
plt.figure(figsize=(12, 8))
# Loop through each feature except 'SEX_raw'
for i, column in enumerate([col for col in X_raw.columns if col != 'SEX_raw'], 1):
plt.subplot(3, 4, i) # Adjust grid (3 rows x 4 columns)
# Compute skewness
skewness = skew(X_raw[column], nan_policy='omit')
# Plot histogram
sns.histplot(X_raw[column], kde=True, color="skyblue", bins=20)
# Title with skewness info
plt.title(f"{column}\nSkewness = {skewness:.2f}", fontsize=10)
plt.xlabel("") # Hide x-label
plt.ylabel("Frequency")
plt.tight_layout()
plt.suptitle("Histograms for All Features (Excluding 'Sex'): Raw Data", y=1.02, fontsize=14)
plt.show()
Interpretation Histogram plot¶
Skewness measures the asymmetry of a distribution around its mean.
It tells us whether the data are concentrated more on one side of the mean than the other.
Interpretation¶
| Type of Skew | Description | Shape |
|---|---|---|
| Skewness ≈ 0 | Nearly symmetric | Bell-shaped distribution |
| Positive skew ( > 0 ) | Long tail to the right; most values are lower than the mean | Right-skewed |
| Negative skew ( < 0 ) | Long tail to the left; most values are higher than the mean | Left-skewed |
Why it matters¶
- Features with strong positive or negative skewness may indicate that the data are not normally distributed.
- Such variables can affect model performance, especially for algorithms sensitive to distributional assumptions (e.g., linear regression).
- Applying transformations such as log, square-root, or Box-Cox can reduce skewness and help stabilize variance.
In the histograms above, each subplot displays the skewness value for the corresponding feature,
helping us quickly identify which variables deviate most from a symmetric distribution.
Section 1. Modeling with Original Features¶
Now we are ready to begin the first set of experiments using the original features (without any interaction terms).
Write the code for each empty section below. You may refer to examples from class or lab sessions as needed.
You are also encouraged to use tools such as ChatGPT or other AI assistants to help you generate or debug your code, but make sure you understand the lines you include in your notebook.
Step 2: Split Data into Training and Test Sets¶
Use test_size= 0.2 and random_state=42.
We will use the data that are tabulated in sklearn databse for our analysis. In other words, from sklearn.datasets import load_diabetes.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shape of each new variable after splitting
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
X_train shape: (353, 10) X_test shape: (89, 10) y_train shape: (353,) y_test shape: (89,)
Step 5: Scaling the Data¶
Scaling is performed to standardize the features so that they have a mean of 0 and a standard deviation of 1. This ensures that all features contribute equally to the model and prevents features with larger magnitudes from dominating the learning process.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
#Converting the scaled arrays back to DataFrames for better readability
df_train_scaled = pd.DataFrame(X_train_s, columns=X.columns)
df_test_scaled = pd.DataFrame(X_test_s, columns=X.columns)
# Display the first 5 rows of the scaled training and test data
print("Here are the first 5 rows of scaled training and test data:")
display(df_train_scaled.head())
print("Here are the first 5 rows of scaled test data:")
display(df_test_scaled.head())
Here are the first 5 rows of scaled training and test data:
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.498365 | 1.061370 | 0.219902 | 1.138874 | 0.728473 | 1.055893 | -0.824451 | 0.711038 | 0.547482 | -0.061449 |
| 1 | -0.228858 | 1.061370 | -0.419366 | -0.710591 | -0.424929 | 0.272425 | -1.529791 | 1.484286 | -0.019757 | 0.367236 |
| 2 | 0.085182 | -0.942179 | 1.018987 | 1.992473 | -0.309589 | -0.326699 | -0.119111 | -0.062210 | 0.331237 | -0.318660 |
| 3 | -0.621409 | -0.942179 | -0.784662 | -0.639458 | -1.174640 | -1.215508 | 0.664600 | -0.835458 | -1.069682 | -2.719299 |
| 4 | -0.542899 | -0.942179 | -1.423930 | -1.706457 | -0.799784 | -1.110167 | 1.291569 | -1.608706 | -0.802859 | -0.918820 |
Here are the first 5 rows of scaled test data:
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.948794 | -0.942179 | -0.168225 | -0.354925 | 2.631586 | 2.649166 | 0.429487 | 0.711038 | 0.653601 | -0.147186 |
| 1 | 1.969426 | -0.942179 | 0.745015 | 0.427541 | -0.511434 | -0.333282 | 0.037631 | -0.835458 | -0.496909 | -0.490134 |
| 2 | 1.341345 | 1.061370 | -0.122563 | -0.283791 | 2.170225 | 1.042726 | 1.213198 | -0.062210 | 1.743607 | -0.404397 |
| 3 | 2.047936 | -0.942179 | 1.064649 | 1.613333 | 1.160999 | 0.785959 | -1.608162 | 2.953457 | 2.040014 | 1.224607 |
| 4 | 0.242203 | 1.061370 | -0.465028 | -0.070392 | 0.814978 | 1.134899 | -0.119111 | 0.711038 | -0.133128 | -0.232923 |
Step 4: Fit a Linear Regression Model¶
Create an object named lr to fit a linear regression model. Report the R² score on the test set.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Create and fit a Linear Regression model
lr = LinearRegression().fit(X_train_s, y_train)
# Calculate the R² score on the test set
r2_lr = r2_score(y_test, lr.predict(X_test_s))
# Print the R² score to evaluate model performance
print(f"R² (Linear Regression): {r2_lr:.3f}")
R² (Linear Regression): 0.453
Step 5: Ridge Regression with Cross-Validation¶
Create an object named ridge to fit a 5-fold cross-validated Ridge regression model using alphas = [0.01, 0.1, 1, 10, 100]. Report the R² score on the test set, and the best alpha value.
Just a reminder of Ridge, Lasso loss functions and the k-fold CV (k-fold CV image source):

from sklearn.linear_model import RidgeCV
# Create a Ridge regression model with cross-validation
ridge = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5)
# Fit the model to the training data
ridge.fit(X_train_s, y_train)
# Calculate the R² score on the test set
r2_ridge = r2_score(y_test, ridge.predict(X_test_s))
# Optional: to see the R² scores for training and test sets
r2_ridge_train = r2_score(y_train, ridge.predict(X_train_s))
r2_ridge_test = r2_ridge
# Print the R² score and the best alpha value selected by cross-validation
print(f"R² (Ridge): {r2_ridge:.3f}, Best alpha: {ridge.alpha_}")
print("==================================================")
print("Optional info: ")
# To see the R² scores for training and test sets
print(f"R² (Ridge) on Training set: {r2_ridge_train:.3f}")
print(f"R² (Ridge) on Test set is {r2_ridge_test:.3f}, which is dropped compared to R² on training set.")
R² (Ridge): 0.457, Best alpha: 10.0 ================================================== Optional info: R² (Ridge) on Training set: 0.525 R² (Ridge) on Test set is 0.457, which is dropped compared to R² on training set.
# ploting coefficients path for ridge regression with CV
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import RidgeCV
#Ignoring warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")
alphas=[0.01, 0.1, 1, 10, 100]
coefs = []
for a in alphas:
ridge = RidgeCV(alphas=[a], cv=5)
ridge.fit(X_train_s, y_train)
coefs.append(ridge.coef_)
coefs = np.array(coefs)
plt.figure(figsize=(10, 6))
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('RidgeCV Coefficients as a Function of Alpha')
plt.axis('tight')
#showing feature names as legends
for i, feature in enumerate(X.columns):
plt.plot([], [], label=feature)
plt.legend()
plt.show()
Step 6: Lasso Regression with Cross-Validation¶
Create an object named lasso to fit a 5-fold cross-validated Lasso regression model using alphas = [0.001, 0.01, 0.1, 1, 10] and max_iter=10000. Report the R² score on the test set, the best alpha value, and the number of non-zero coefficients.
from sklearn.linear_model import LassoCV
lasso = LassoCV(alphas=[0.001, 0.01, 0.1, 1, 10], cv=5, max_iter=10000)
lasso.fit(X_train_s, y_train)
r2_lasso = r2_score(y_test, lasso.predict(X_test_s))
# Optional: to see the R² scores for training and test sets
r2_lasso_train = r2_score(y_train, lasso.predict(X_train_s))
r2_lasso_test = r2_lasso
print(f"R² (Lasso): {r2_lasso:.3f}, Best alpha: {lasso.alpha_}")
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()} out of {len(lasso.coef_)}")
print("==================================================")
print("Optional info: ")
print(f"R² (Lasso) on Training set: {r2_lasso_train:.3f}")
print(f"R² (Lasso) on Test set is {r2_lasso_test:.3f}, which is dropped compared to R² on training set.")
R² (Lasso): 0.467, Best alpha: 1.0 Non-zero coefficients: 9 out of 10 ================================================== Optional info: R² (Lasso) on Training set: 0.523 R² (Lasso) on Test set is 0.467, which is dropped compared to R² on training set.
# ploting coefficients path for lasso regression with CV and alphas alphas=[0.01, 0.1, 1, 10, 100]
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LassoCV
#Ignoring warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")
alphas=[0.001, 0.01, 0.1, 1, 10, 100]
coefs = []
for a in alphas:
lasso = LassoCV(alphas=[a], cv=5, max_iter=10000)
lasso.fit(X_train_s, y_train)
coefs.append(lasso.coef_)
coefs = np.array(coefs)
plt.figure(figsize=(10, 6))
for i in range(coefs.shape[1]):
plt.plot(alphas, coefs[:, i], label=X.columns[i])
plt.title("LassoCV Coefficients as a Function of Alpha")
plt.xlabel("Alpha")
plt.ylabel("Coefficient Value")
plt.xscale("log")
#showing best alpha line
plt.axvline(x=1, color='red', linestyle='--', label='Best alpha')
plt.text(1.2, plt.ylim()[0], ' alpha=1', color='red', rotation=90, verticalalignment='bottom')
plt.legend()
plt.show()
In the figure, above we can see LassoCV is trying to bring more features into the analysis to imporve $R^2$.
Step 7: Lasso Regression with a Fixed Penalty¶
As discussed in class and during the lab session, the best alpha selected through cross-validation for Lasso is chosen to maximize predictive performance, NOT necessarily to improve model interpretability.
In this section, write code to fit a Lasso regression model with a larger, fixed value of alpha = 5 (without cross-validation). Name the object lasso_c; report the R² score on the test set and the number of non-zero coefficients.
from sklearn.linear_model import Lasso
lasso_c = Lasso(alpha=5)
lasso_c.fit(X_train_s, y_train)
r2_lasso_c = r2_score(y_test, lasso_c.predict(X_test_s))
print(f"R² (Lasso): {r2_lasso_c:.3f}")
print(f"Non-zero coefficients: {(lasso_c.coef_ != 0).sum()} out of {len(lasso_c.coef_)}")
R² (Lasso): 0.465 Non-zero coefficients: 5 out of 10
We can plot the coefficient path for Ridge, RidgeCV, Lasso and Lasso CV.
# showing the coefficent path for lasso with fixed alpha
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Lasso
#Ignoring warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")
alphas = np.logspace(-2, 2, 100)
coefs = []
for a in alphas:
lasso = Lasso(alpha=a)
lasso.fit(X_train_s, y_train)
coefs.append(lasso.coef_)
coefs = np.array(coefs)
plt.figure(figsize=(10, 6))
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as a Function of Alpha')
# showing the features and alpha=5 line as legends
for i, feature in enumerate(X.columns):
plt.plot([], [], label=feature)
plt.axis('tight')
# showing a verical line at alpha=5 and writing alpha=5 next to it with angle=90 at the lower part of the line and showing legend
plt.axvline(x=5, color='red', linestyle='--', label='Alpha = 5')
plt.text(5.5, plt.ylim()[0], ' alpha=5', color='red', rotation=90, verticalalignment='bottom')
plt.legend()
plt.show()
We can see that above alpha=5, only 5 important features which are: bmi, s5, bp, sex, and s3.
Step 8: Identifying the Important Features¶
Next we want to create a plot to help us visualize which predictors remain active in the model and whether they have a positive or negative association with the target.
Create a horizontal bar chart showing the non-zero Lasso coefficients for alpha = 5.
- Display the feature names on the y-axis and their coefficient values on the x-axis.
- Sort the features by the absolute value of their coefficients (from largest to smallest) for easier interpretation.
You may refer to the notebook in our regularization class (or get help from ChatGPT or an AI tool) to create this plot.
# Visualizing Non-Zero Lasso Coefficients (alpha = 5)
import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame of feature names and their coefficients
coef_df = pd.DataFrame({
'Feature': X.columns,
'Coefficient': lasso_c.coef_
})
# Filter out zero coefficients
nonzero_df = coef_df[coef_df['Coefficient'] != 0]
# Sort by absolute value for clearer visualization
nonzero_df = nonzero_df.reindex(nonzero_df['Coefficient'].abs().sort_values(ascending=True).index)
#Display the non-zero coefficients DataFrame in a sorted manner from largest to smallest
print("Non-zero Lasso Coefficients (alpha = 5) sorted from largest to smallest:")
display(nonzero_df.sort_values(by='Coefficient', ascending=False))
# Plot
plt.figure(figsize=(8,5))
plt.barh(nonzero_df['Feature'], nonzero_df['Coefficient'], color='steelblue')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.title('Non-zero Lasso Coefficients (alpha = 5)')
plt.axvline(0, color='black', linewidth=1)
plt.tight_layout()
plt.show()
Non-zero Lasso Coefficients (alpha = 5) sorted from largest to smallest:
| Feature | Coefficient | |
|---|---|---|
| 2 | bmi | 25.850391 |
| 8 | s5 | 18.664676 |
| 3 | bp | 11.923355 |
| 1 | sex | -2.506567 |
| 6 | s3 | -8.262923 |
Step 9: Reflection on Experiment 1¶
Answer the following questions:
Which model gives the highest R² on the test set?
The Ridge regression model gives the highest R² on the test set, indicating that it balances bias and variance effectively while handling multicollinearity.How do Ridge and Lasso coefficients compare to Linear Regression?
Ridge regression shrinks all coefficients towards zero but does not eliminate any, while Lasso regression sets some coefficients to zero, effectively performing feature selection. Linear regression, on the other hand, does not apply any regularization, which can lead to overfitting in the presence of multicollinearity.How does increasing
alphaaffect model complexity for Lasso?
Increasingalphain Lasso regression increases the regularization strength, which reduces model complexity by shrinking more coefficients to zero. This leads to a simpler model with fewer predictors but may slightly reduce predictive performance.Interpret the top 5 factors identified by Lasso regression.
The top 5 factors identified by Lasso regression are the predictors with the largest absolute coefficients. These factors have the strongest influence on the target variable. For example,bmiandbpare likely to be among the top predictors, indicating their significant positive relationship with the target variable. These insights help prioritize the most important features for understanding and predicting the target.
Section 2. Modeling with Original and Interaction Features¶
In this section, you will expand the feature space by creating second-degree interaction features.
The general equation for a polynomial regression model with a single feature is:
$$ y = \beta_0 + \beta_1x + \beta_2x^2 + \dots + \beta_nx^n $$
For a model with multiple features, the equation also includes interaction terms. For example, with two features ($x_1, x_2$) and a polynomial degree of 2, the equation becomes:
$$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1^2 + \beta_4x_2^2 + \beta_5x_1x_2$$
This is what PolynomialFeatures generates for you.
Step 10: Generate Polynomial Features¶
Use the PolynomialFeatures class from sklearn.preprocessing with the argument interaction_only=True to generate only interaction terms (no squared terms).
Name the new variables X_train_poly and X_test_poly.
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = pf.fit_transform(X_train)
X_test_poly = pf.transform(X_test)
print(f"Feature space expanded from {X_train.shape[1]} to {X_train_poly.shape[1]} features.")
#showing some of the new features names, data frame shows the column names as features and their values as first 20 rows
feature_names = pf.get_feature_names_out(X.columns)
df_poly_features = pd.DataFrame(X_train_poly, columns=feature_names)
print("Here are some of the new polynomial features created (first 20 rows):")
display(df_poly_features.head(20))
Feature space expanded from 10 to 55 features. Here are some of the new polynomial features created (first 20 rows):
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | age sex | age bmi | age bp | age s1 | age s2 | age s3 | age s4 | age s5 | age s6 | sex bmi | sex bp | sex s1 | sex s2 | sex s3 | sex s4 | sex s5 | sex s6 | bmi bp | bmi s1 | bmi s2 | bmi s3 | bmi s4 | bmi s5 | bmi s6 | bp s1 | bp s2 | bp s3 | bp s4 | bp s5 | bp s6 | s1 s2 | s1 s3 | s1 s4 | s1 s5 | s1 s6 | s2 s3 | s2 s4 | s2 s5 | s2 s6 | s3 s4 | s3 s5 | s3 s6 | s4 s5 | s4 s6 | s5 s6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.070769 | 0.050680 | 0.012117 | 0.056301 | 0.034206 | 0.049416 | -0.039719 | 0.034309 | 0.027364 | -0.001078 | 0.003587 | 0.000857 | 0.003984 | 0.002421 | 0.003497 | -0.002811 | 0.002428 | 0.001937 | -0.000076 | 0.000614 | 0.002853 | 0.001734 | 0.002504 | -0.002013 | 0.001739 | 0.001387 | -0.000055 | 0.000682 | 0.000414 | 0.000599 | -0.000481 | 0.000416 | 0.000332 | -0.000013 | 0.001926 | 0.002782 | -0.002236 | 0.001932 | 0.001541 | -0.000061 | 1.690320e-03 | -0.001359 | 0.001174 | 0.000936 | -3.686352e-05 | -0.001963 | 0.001695 | 0.001352 | -0.000053 | -0.001363 | -0.001087 | 0.000043 | 0.000939 | -0.000037 | -0.000029 |
| 1 | -0.009147 | 0.050680 | -0.018062 | -0.033213 | -0.020832 | 0.012152 | -0.072854 | 0.071210 | 0.000272 | 0.019633 | -0.000464 | 0.000165 | 0.000304 | 0.000191 | -0.000111 | 0.000666 | -0.000651 | -0.000002 | -0.000180 | -0.000915 | -0.001683 | -0.001056 | 0.000616 | -0.003692 | 0.003609 | 0.000014 | 0.000995 | 0.000600 | 0.000376 | -0.000219 | 0.001316 | -0.001286 | -0.000005 | -0.000355 | 0.000692 | -0.000404 | 0.002420 | -0.002365 | -0.000009 | -0.000652 | -2.531438e-04 | 0.001518 | -0.001483 | -0.000006 | -4.089971e-04 | -0.000885 | 0.000865 | 0.000003 | 0.000239 | -0.005188 | -0.000020 | -0.001430 | 0.000019 | 0.001398 | 0.000005 |
| 2 | 0.005383 | -0.044642 | 0.049840 | 0.097615 | -0.015328 | -0.016345 | -0.006584 | -0.002592 | 0.017036 | -0.013504 | -0.000240 | 0.000268 | 0.000525 | -0.000083 | -0.000088 | -0.000035 | -0.000014 | 0.000092 | -0.000073 | -0.002225 | -0.004358 | 0.000684 | 0.000730 | 0.000294 | 0.000116 | -0.000761 | 0.000603 | 0.004865 | -0.000764 | -0.000815 | -0.000328 | -0.000129 | 0.000849 | -0.000673 | -0.001496 | -0.001596 | -0.000643 | -0.000253 | 0.001663 | -0.001318 | 2.505442e-04 | 0.000101 | 0.000040 | -0.000261 | 2.069962e-04 | 0.000108 | 0.000042 | -0.000278 | 0.000221 | 0.000017 | -0.000112 | 0.000089 | -0.000044 | 0.000035 | -0.000230 |
| 3 | -0.027310 | -0.044642 | -0.035307 | -0.029770 | -0.056607 | -0.058620 | 0.030232 | -0.039493 | -0.049872 | -0.129483 | 0.001219 | 0.000964 | 0.000813 | 0.001546 | 0.001601 | -0.000826 | 0.001079 | 0.001362 | 0.003536 | 0.001576 | 0.001329 | 0.002527 | 0.002617 | -0.001350 | 0.001763 | 0.002226 | 0.005780 | 0.001051 | 0.001999 | 0.002070 | -0.001067 | 0.001394 | 0.001761 | 0.004572 | 0.001685 | 0.001745 | -0.000900 | 0.001176 | 0.001485 | 0.003855 | 3.318309e-03 | -0.001711 | 0.002236 | 0.002823 | 7.329654e-03 | -0.001772 | 0.002315 | 0.002924 | 0.007590 | -0.001194 | -0.001508 | -0.003915 | 0.001970 | 0.005114 | 0.006458 |
| 4 | -0.023677 | -0.044642 | -0.065486 | -0.081413 | -0.038720 | -0.053610 | 0.059685 | -0.076395 | -0.037129 | -0.042499 | 0.001057 | 0.001551 | 0.001928 | 0.000917 | 0.001269 | -0.001413 | 0.001809 | 0.000879 | 0.001006 | 0.002923 | 0.003634 | 0.001729 | 0.002393 | -0.002664 | 0.003410 | 0.001657 | 0.001897 | 0.005331 | 0.002536 | 0.003511 | -0.003909 | 0.005003 | 0.002431 | 0.002783 | 0.003152 | 0.004365 | -0.004859 | 0.006220 | 0.003023 | 0.003460 | 2.075750e-03 | -0.002311 | 0.002958 | 0.001438 | 1.645539e-03 | -0.003200 | 0.004095 | 0.001990 | 0.002278 | -0.004560 | -0.002216 | -0.002537 | 0.002836 | 0.003247 | 0.001578 |
| 5 | 0.001751 | -0.044642 | -0.039618 | -0.100934 | -0.029088 | -0.030124 | 0.044958 | -0.050195 | -0.068332 | -0.129483 | -0.000078 | -0.000069 | -0.000177 | -0.000051 | -0.000053 | 0.000079 | -0.000088 | -0.000120 | -0.000227 | 0.001769 | 0.004506 | 0.001299 | 0.001345 | -0.002007 | 0.002241 | 0.003050 | 0.005780 | 0.003999 | 0.001152 | 0.001193 | -0.001781 | 0.001989 | 0.002707 | 0.005130 | 0.002936 | 0.003040 | -0.004538 | 0.005066 | 0.006897 | 0.013069 | 8.762339e-04 | -0.001308 | 0.001460 | 0.001988 | 3.766404e-03 | -0.001354 | 0.001512 | 0.002058 | 0.003900 | -0.002257 | -0.003072 | -0.005821 | 0.003430 | 0.006499 | 0.008848 |
| 6 | 0.016281 | -0.044642 | 0.020739 | 0.021872 | -0.013953 | -0.013214 | -0.006584 | -0.002592 | 0.013317 | 0.040343 | -0.000727 | 0.000338 | 0.000356 | -0.000227 | -0.000215 | -0.000107 | -0.000042 | 0.000217 | 0.000657 | -0.000926 | -0.000976 | 0.000623 | 0.000590 | 0.000294 | 0.000116 | -0.000594 | -0.001801 | 0.000454 | -0.000289 | -0.000274 | -0.000137 | -0.000054 | 0.000276 | 0.000837 | -0.000305 | -0.000289 | -0.000144 | -0.000057 | 0.000291 | 0.000882 | 1.843621e-04 | 0.000092 | 0.000036 | -0.000186 | -5.628923e-04 | 0.000087 | 0.000034 | -0.000176 | -0.000533 | 0.000017 | -0.000088 | -0.000266 | -0.000035 | -0.000105 | 0.000537 |
| 7 | 0.009016 | 0.050680 | 0.069241 | 0.059744 | 0.017694 | -0.023234 | -0.047082 | 0.034309 | 0.103297 | 0.073480 | 0.000457 | 0.000624 | 0.000539 | 0.000160 | -0.000209 | -0.000424 | 0.000309 | 0.000931 | 0.000662 | 0.003509 | 0.003028 | 0.000897 | -0.001178 | -0.002386 | 0.001739 | 0.005235 | 0.003724 | 0.004137 | 0.001225 | -0.001609 | -0.003260 | 0.002376 | 0.007152 | 0.005088 | 0.001057 | -0.001388 | -0.002813 | 0.002050 | 0.006171 | 0.004390 | -4.111160e-04 | -0.000833 | 0.000607 | 0.001828 | 1.300187e-03 | 0.001094 | -0.000797 | -0.002400 | -0.001707 | -0.001615 | -0.004863 | -0.003460 | 0.003544 | 0.002521 | 0.007590 |
| 8 | -0.009147 | -0.044642 | 0.037984 | -0.040099 | -0.024960 | -0.003819 | -0.043401 | 0.015858 | -0.005142 | 0.027917 | 0.000408 | -0.000347 | 0.000367 | 0.000228 | 0.000035 | 0.000397 | -0.000145 | 0.000047 | -0.000255 | -0.001696 | 0.001790 | 0.001114 | 0.000170 | 0.001937 | -0.000708 | 0.000230 | -0.001246 | -0.001523 | -0.000948 | -0.000145 | -0.001649 | 0.000602 | -0.000195 | 0.001060 | 0.001001 | 0.000153 | 0.001740 | -0.000636 | 0.000206 | -0.001119 | 9.532447e-05 | 0.001083 | -0.000396 | 0.000128 | -6.968140e-04 | 0.000166 | -0.000061 | 0.000020 | -0.000107 | -0.000688 | 0.000223 | -0.001212 | -0.000082 | 0.000443 | -0.000144 |
| 9 | -0.078165 | -0.044642 | -0.016984 | -0.012556 | -0.000193 | -0.013527 | 0.070730 | -0.039493 | -0.041176 | -0.092204 | 0.003489 | 0.001328 | 0.000981 | 0.000015 | 0.001057 | -0.005529 | 0.003087 | 0.003219 | 0.007207 | 0.000758 | 0.000561 | 0.000009 | 0.000604 | -0.003157 | 0.001763 | 0.001838 | 0.004116 | 0.000213 | 0.000003 | 0.000230 | -0.001201 | 0.000671 | 0.000699 | 0.001566 | 0.000002 | 0.000170 | -0.000888 | 0.000496 | 0.000517 | 0.001158 | 2.610741e-06 | -0.000014 | 0.000008 | 0.000008 | 1.779602e-05 | -0.000957 | 0.000534 | 0.000557 | 0.001247 | -0.002793 | -0.002912 | -0.006522 | 0.001626 | 0.003641 | 0.003797 |
| 10 | 0.056239 | -0.044642 | -0.068719 | -0.068778 | -0.000193 | -0.001001 | 0.044958 | -0.037648 | -0.048359 | -0.001078 | -0.002511 | -0.003865 | -0.003868 | -0.000011 | -0.000056 | 0.002528 | -0.002117 | -0.002720 | -0.000061 | 0.003068 | 0.003070 | 0.000009 | 0.000045 | -0.002007 | 0.001681 | 0.002159 | 0.000048 | 0.004726 | 0.000013 | 0.000069 | -0.003090 | 0.002587 | 0.003323 | 0.000074 | 0.000013 | 0.000069 | -0.003092 | 0.002589 | 0.003326 | 0.000074 | 1.931477e-07 | -0.000009 | 0.000007 | 0.000009 | 2.080031e-07 | -0.000045 | 0.000038 | 0.000048 | 0.000001 | -0.001693 | -0.002174 | -0.000048 | 0.001821 | 0.000041 | 0.000052 |
| 11 | 0.041708 | 0.050680 | 0.019662 | 0.059744 | -0.005697 | -0.002566 | -0.028674 | -0.002592 | 0.031193 | 0.007207 | 0.002114 | 0.000820 | 0.002492 | -0.000238 | -0.000107 | -0.001196 | -0.000108 | 0.001301 | 0.000301 | 0.000996 | 0.003028 | -0.000289 | -0.000130 | -0.001453 | -0.000131 | 0.001581 | 0.000365 | 0.001175 | -0.000112 | -0.000050 | -0.000564 | -0.000051 | 0.000613 | 0.000142 | -0.000340 | -0.000153 | -0.001713 | -0.000155 | 0.001864 | 0.000431 | 1.462072e-05 | 0.000163 | 0.000015 | -0.000178 | -4.105421e-05 | 0.000074 | 0.000007 | -0.000080 | -0.000018 | 0.000074 | -0.000894 | -0.000207 | -0.000081 | -0.000019 | 0.000225 |
| 12 | 0.001751 | 0.050680 | -0.057941 | -0.043542 | -0.096510 | -0.047034 | -0.098625 | 0.034309 | -0.061176 | -0.071494 | 0.000089 | -0.000101 | -0.000076 | -0.000169 | -0.000082 | -0.000173 | 0.000060 | -0.000107 | -0.000125 | -0.002936 | -0.002207 | -0.004891 | -0.002384 | -0.004998 | 0.001739 | -0.003100 | -0.003623 | 0.002523 | 0.005592 | 0.002725 | 0.005714 | -0.001988 | 0.003545 | 0.004142 | 0.004202 | 0.002048 | 0.004294 | -0.001494 | 0.002664 | 0.003113 | 4.539194e-03 | 0.009518 | -0.003311 | 0.005904 | 6.899818e-03 | 0.004639 | -0.001614 | 0.002877 | 0.003363 | -0.003384 | 0.006033 | 0.007051 | -0.002099 | -0.002453 | 0.004374 |
| 13 | 0.030811 | -0.044642 | 0.104809 | 0.076958 | -0.011201 | -0.011335 | -0.058127 | 0.034309 | 0.057108 | 0.036201 | -0.001375 | 0.003229 | 0.002371 | -0.000345 | -0.000349 | -0.001791 | 0.001057 | 0.001760 | 0.001115 | -0.004679 | -0.003436 | 0.000500 | 0.000506 | 0.002595 | -0.001532 | -0.002549 | -0.001616 | 0.008066 | -0.001174 | -0.001188 | -0.006092 | 0.003596 | 0.005985 | 0.003794 | -0.000862 | -0.000872 | -0.004473 | 0.002640 | 0.004395 | 0.002786 | 1.269550e-04 | 0.000651 | -0.000384 | -0.000640 | -4.054770e-04 | 0.000659 | -0.000389 | -0.000647 | -0.000410 | -0.001994 | -0.003320 | -0.002104 | 0.001959 | 0.001242 | 0.002067 |
| 14 | -0.063635 | 0.050680 | -0.079497 | -0.005670 | -0.071743 | -0.066449 | -0.010266 | -0.039493 | -0.018114 | -0.054925 | -0.003225 | 0.005059 | 0.000361 | 0.004565 | 0.004228 | 0.000653 | 0.002513 | 0.001153 | 0.003495 | -0.004029 | -0.000287 | -0.003636 | -0.003368 | -0.000520 | -0.002002 | -0.000918 | -0.002784 | 0.000451 | 0.005703 | 0.005282 | 0.000816 | 0.003140 | 0.001440 | 0.004366 | 0.000407 | 0.000377 | 0.000058 | 0.000224 | 0.000103 | 0.000311 | 4.767204e-03 | 0.000737 | 0.002833 | 0.001300 | 3.940466e-03 | 0.000682 | 0.002624 | 0.001204 | 0.003650 | 0.000405 | 0.000186 | 0.000564 | 0.000715 | 0.002169 | 0.000995 |
| 15 | 0.067136 | 0.050680 | 0.020739 | -0.005670 | 0.020446 | 0.026243 | -0.002903 | -0.002592 | 0.008641 | 0.003064 | 0.003402 | 0.001392 | -0.000381 | 0.001373 | 0.001762 | -0.000195 | -0.000174 | 0.000580 | 0.000206 | 0.001051 | -0.000287 | 0.001036 | 0.001330 | -0.000147 | -0.000131 | 0.000438 | 0.000155 | -0.000118 | 0.000424 | 0.000544 | -0.000060 | -0.000054 | 0.000179 | 0.000064 | -0.000116 | -0.000149 | 0.000016 | 0.000015 | -0.000049 | -0.000017 | 5.365757e-04 | -0.000059 | -0.000053 | 0.000177 | 6.265579e-05 | -0.000076 | -0.000068 | 0.000227 | 0.000080 | 0.000008 | -0.000025 | -0.000009 | -0.000022 | -0.000008 | 0.000026 |
| 16 | -0.092695 | -0.044642 | -0.040696 | -0.019442 | -0.068991 | -0.079288 | 0.041277 | -0.076395 | -0.041176 | -0.096346 | 0.004138 | 0.003772 | 0.001802 | 0.006395 | 0.007350 | -0.003826 | 0.007081 | 0.003817 | 0.008931 | 0.001817 | 0.000868 | 0.003080 | 0.003540 | -0.001843 | 0.003410 | 0.001838 | 0.004301 | 0.000791 | 0.002808 | 0.003227 | -0.001680 | 0.003109 | 0.001676 | 0.003921 | 0.001341 | 0.001542 | -0.000802 | 0.001485 | 0.000801 | 0.001873 | 5.470120e-03 | -0.002848 | 0.005271 | 0.002841 | 6.646984e-03 | -0.003273 | 0.006057 | 0.003265 | 0.007639 | -0.003153 | -0.001700 | -0.003977 | 0.003146 | 0.007360 | 0.003967 |
| 17 | 0.027178 | 0.050680 | -0.035307 | 0.032201 | -0.011201 | 0.001504 | -0.010266 | -0.002592 | -0.014960 | -0.050783 | 0.001377 | -0.000960 | 0.000875 | -0.000304 | 0.000041 | -0.000279 | -0.000070 | -0.000407 | -0.001380 | -0.001789 | 0.001632 | -0.000568 | 0.000076 | -0.000520 | -0.000131 | -0.000758 | -0.002574 | -0.001137 | 0.000395 | -0.000053 | 0.000362 | 0.000092 | 0.000528 | 0.001793 | -0.000361 | 0.000048 | -0.000331 | -0.000083 | -0.000482 | -0.001635 | -1.685089e-05 | 0.000115 | 0.000029 | 0.000168 | 5.688014e-04 | -0.000015 | -0.000004 | -0.000023 | -0.000076 | 0.000027 | 0.000154 | 0.000521 | 0.000039 | 0.000132 | 0.000760 |
| 18 | 0.041708 | 0.050680 | -0.043929 | 0.063187 | -0.004321 | 0.016222 | -0.013948 | -0.002592 | -0.034522 | 0.011349 | 0.002114 | -0.001832 | 0.002635 | -0.000180 | 0.000677 | -0.000582 | -0.000108 | -0.001440 | 0.000473 | -0.002226 | 0.003202 | -0.000219 | 0.000822 | -0.000707 | -0.000131 | -0.001750 | 0.000575 | -0.002776 | 0.000190 | -0.000713 | 0.000613 | 0.000114 | 0.001517 | -0.000499 | -0.000273 | 0.001025 | -0.000881 | -0.000164 | -0.002181 | 0.000717 | -7.009497e-05 | 0.000060 | 0.000011 | 0.000149 | -4.903588e-05 | -0.000226 | -0.000042 | -0.000560 | 0.000184 | 0.000036 | 0.000482 | -0.000158 | 0.000089 | -0.000029 | -0.000392 |
| 19 | 0.023546 | -0.044642 | 0.019662 | -0.012556 | 0.083740 | 0.038769 | 0.063367 | -0.002592 | 0.066051 | 0.048628 | -0.001051 | 0.000463 | -0.000296 | 0.001972 | 0.000913 | 0.001492 | -0.000061 | 0.001555 | 0.001145 | -0.000878 | 0.000561 | -0.003738 | -0.001731 | -0.002829 | 0.000116 | -0.002949 | -0.002171 | -0.000247 | 0.001646 | 0.000762 | 0.001246 | -0.000051 | 0.001299 | 0.000956 | -0.001051 | -0.000487 | -0.000796 | 0.000033 | -0.000829 | -0.000611 | 3.246531e-03 | 0.005306 | -0.000217 | 0.005531 | 4.072080e-03 | 0.002457 | -0.000100 | 0.002561 | 0.001885 | -0.000164 | 0.004185 | 0.003081 | -0.000171 | -0.000126 | 0.003212 |
0.0035865729200000003
Step 11: Standardize the New Expanded Feature Space¶
Name the new variables X_train_poly_s and X_test_poly_s.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_poly_s = scaler.fit_transform(X_train_poly)
X_test_poly_s = scaler.transform(X_test_poly)
#showing the first 5 rows of the scaled polynomial features training data, for the first 20 features
print("First 5 rows of the scaled polynomial features training data (first 20 features):")
display(pd.DataFrame(X_train_poly_s, columns=pf.get_feature_names_out()).iloc[:, :20].head()) # Display first 5 rows and first 20 features
print("First 5 rows of the scaled polynomial features test data (first 20 features):")
display(pd.DataFrame(X_test_poly_s, columns=pf.get_feature_names_out()).iloc[:, :20].head()) # Display first 5 rows and first 20 features
First 5 rows of the scaled polynomial features training data (first 20 features):
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | age sex | age bmi | age bp | age s1 | age s2 | age s3 | age s4 | age s5 | age s6 | sex bmi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.498365 | 1.061370 | 0.219902 | 1.138874 | 0.728473 | 1.055893 | -0.824451 | 0.711038 | 0.547482 | -0.061449 | 1.478685 | 0.201656 | 1.600119 | 0.775601 | 1.313181 | -1.356616 | 0.961206 | 0.615275 | -0.340763 | 0.240746 |
| 1 | -0.228858 | 1.061370 | -0.419366 | -0.710591 | -0.424929 | 0.272425 | -1.529791 | 1.484286 | -0.019757 | 0.367236 | -0.387285 | -0.107493 | -0.196386 | -0.172306 | -0.257651 | 0.381348 | -0.496928 | -0.273975 | -0.390426 | -0.444237 |
| 2 | 0.085182 | -0.942179 | 1.018987 | 1.992473 | -0.309589 | -0.326699 | -0.119111 | -0.062210 | 0.331237 | -0.318660 | -0.284422 | -0.061461 | -0.088190 | -0.288372 | -0.247566 | 0.030562 | -0.195102 | -0.230775 | -0.339045 | -1.030741 |
| 3 | -0.621409 | -0.942179 | -0.784662 | -0.639458 | -1.174640 | -1.215508 | 0.664600 | -0.835458 | -1.069682 | -2.719299 | 0.387976 | 0.249318 | 0.052168 | 0.403784 | 0.487677 | -0.364376 | 0.322221 | 0.351797 | 1.395662 | 0.671619 |
| 4 | -0.542899 | -0.942179 | -1.423930 | -1.706457 | -0.799784 | -1.110167 | 1.291569 | -1.608706 | -0.802859 | -0.918820 | 0.313265 | 0.511137 | 0.596223 | 0.136368 | 0.343330 | -0.658037 | 0.668011 | 0.130336 | 0.179585 | 1.274988 |
First 5 rows of the scaled polynomial features test data (first 20 features):
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | age sex | age bmi | age bp | age s1 | age s2 | age s3 | age s4 | age s5 | age s6 | sex bmi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.948794 | -0.942179 | -0.168225 | -0.354925 | 2.631586 | 2.649166 | 0.429487 | 0.711038 | 0.653601 | -0.147186 | -1.106243 | -0.306928 | -0.698754 | 2.156038 | 2.262002 | 0.483086 | 0.548109 | 0.401560 | -0.417867 | 0.089800 |
| 1 | 1.969426 | -0.942179 | 0.745015 | 0.427541 | -0.511434 | -0.333282 | 0.037631 | -0.835458 | -0.496909 | -0.490134 | -2.077486 | 1.344292 | 0.643548 | -1.235322 | -0.880535 | 0.084308 | -1.919519 | -1.228677 | -1.273545 | -0.772154 |
| 2 | 1.341345 | 1.061370 | -0.122563 | -0.283791 | 2.170225 | 1.042726 | 1.213198 | -0.062210 | 1.743607 | -0.404397 | 1.309051 | -0.296133 | -0.733874 | 2.526941 | 1.139573 | 1.825796 | -0.266444 | 2.187851 | -0.842752 | -0.126209 |
| 3 | 2.047936 | -0.942179 | 1.064649 | 1.613333 | 1.160999 | 0.785959 | -1.608162 | 2.953457 | 2.040014 | 1.224607 | -2.152197 | 2.052371 | 3.377151 | 1.989183 | 1.322527 | -3.631520 | 6.248843 | 4.079195 | 2.519026 | -1.073839 |
| 4 | 0.242203 | 1.061370 | -0.465028 | -0.070392 | 0.814978 | 1.134899 | -0.119111 | 0.711038 | -0.133128 | -0.232923 | 0.121616 | -0.295464 | -0.358428 | -0.047218 | 0.083528 | 0.006653 | 0.016985 | -0.302660 | -0.361021 | -0.493164 |
Data Visualization: Polynomials¶
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Convert to DataFrame
df_poly = pd.DataFrame(X_train_poly_s, columns=pf.get_feature_names_out())
# Select first 20 features
vars_for_plot_poly = df_poly.columns[:20]
# Create pair plot
pairplot_poly = sns.pairplot(
df_poly,
vars=vars_for_plot_poly,
diag_kind='kde',
height=2.3,
plot_kws={'alpha': 0.6, 's': 20, 'edgecolor': 'k'}
)
# Increase axis label font sizes
for ax in pairplot_poly.axes.flatten():
ax.set_xlabel(ax.get_xlabel(), fontsize=25)
ax.set_ylabel(ax.get_ylabel(), fontsize=25)
ax.tick_params(axis='both', labelsize=25)
# Add overall title
plt.suptitle(
"Pair Plot of First 20 Polynomial Features (Degree 2)",
y=1.02,
fontsize=40
)
plt.show()
#ploting the correlation heatmap for all 55 polynomial features not showing the numbers on the map
import matplotlib.pyplot as plt
import seaborn as sns
# Correlation heatmap without annotations but showing all 55 features names on the axes
plt.figure(figsize=(30, 25))
sns.heatmap(df_poly.corr(), annot=False, fmt=".2f", cmap="coolwarm", square=True, cbar_kws={"shrink": .8})
plt.title("Correlation Heatmap of All Polynomial Features (Degree 2)", fontsize=20)
plt.show()
# Plotting Histograms for All Polynomial Features (Degree 2)
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram for all polynomial features
plt.figure(figsize=(20, 15))
for i, col in enumerate(df_poly.columns):
plt.subplot(7, 8, i + 1)
sns.histplot(df_poly[col], bins=30, kde=True)
plt.title(col)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
plt.suptitle("Histograms for All Polynomial Features (Degree 2)", y=1.02, fontsize=20)
# showing Skewness values on each subplot
for i, col in enumerate(df_poly.columns):
skewness = df_poly[col].skew()
plt.subplot(7, 8, i + 1)
plt.title(f"{col}\nSkewness = {skewness:.2f}", fontsize=10)
plt.show()
Step 12: Linear Regression Baseline (Polynomial Features)¶
Create an object named lr_poly to fit a linear regression model with the expanded feature space. Report the R² score on the test set.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr_poly = LinearRegression().fit(X_train_poly_s, y_train)
r2_lr_poly = r2_score(y_test, lr_poly.predict(X_test_poly_s))
print(f"R² (Linear Regression with Polynomial Features): {r2_lr_poly:.3f}")
R² (Linear Regression with Polynomial Features): 0.478
Step 13: Ridge Regression with Cross-Validation¶
Create an object named ridge_poly to fit a 5-fold cross-validated Ridge regression model using alphas = [0.01, 0.1, 1, 10, 100] with the expanded feature space. Report the R² score on the test set, and the best alpha value.
from sklearn.linear_model import RidgeCV
ridge_poly = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100])
ridge_poly.fit(X_train_poly_s, y_train)
r2_ridge_poly = r2_score(y_test, ridge_poly.predict(X_test_poly_s))
print(f"R² (Ridge + Polynomial): {r2_ridge_poly:.3f}, Best alpha: {ridge_poly.alpha_}")
#Pkloting coefficients path for ridge regression with CV on polynomial features
#Ignoring warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")
alphas=[0.01, 0.1, 1, 10, 100]
coefs = []
for a in alphas:
ridge = RidgeCV(alphas=[a], cv=5)
ridge.fit(X_train_poly_s, y_train)
coefs.append(ridge.coef_)
coefs = np.array(coefs)
plt.figure(figsize=(8, 6))
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('RidgeCV Coefficients as a Function of Alpha (Polynomial Features)')
plt.axis('tight')
#showing feature names as legends
for i, feature in enumerate(pf.get_feature_names_out()):
plt.plot([], [], label=feature)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', ncol=3)
plt.show()
R² (Ridge + Polynomial): 0.500, Best alpha: 100.0
Step 14: Lasso Regression with Cross-Validation¶
Create an object named lasso_poly to fit a 5-fold cross-validated Lasso regression model using alphas = [0.001, 0.01, 0.05, 0.1, 0.5, 1] and max_iter=500000. Report the R² score on the test set, the best alpha value, and the number of non-zero coefficients. Note that, compared to Lasso CV with the original feature space, we have increased max_iter argument to ensure algorithm convergence; you can try a smaller value to see the warning message.
from sklearn.linear_model import LassoCV
lasso_poly = LassoCV(alphas=[0.001, 0.01, 0.05, 0.1, 0.5, 1], max_iter=500000)
lasso_poly.fit(X_train_poly_s, y_train)
r2_lasso_poly = r2_score(y_test, lasso_poly.predict(X_test_poly_s))
print(f"R² (Lasso + Polynomial): {r2_lasso_poly:.3f}, Best alpha: {lasso_poly.alpha_}")
print(f"Non-zero coefficients: {(lasso_poly.coef_ != 0).sum()} out of {len(lasso_poly.coef_)}")
# ploting coefficients path for lasso regression with CV on polynomial features
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LassoCV
#Ignoring warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")
R² (Lasso + Polynomial): 0.514, Best alpha: 1.0 Non-zero coefficients: 32 out of 55
Step 15: Lasso Regression with a Fixed Penalty¶
Similar to Step 7, we can improve model interpretability by increasing alpha without a significant drop in predictive performance. Write code to fit a Lasso regression model with a larger, fixed value of alpha = 5 (without cross-validation). Name the object lasso_poly_c, and report the R² score on the test set and the number of non-zero coefficients.
from sklearn.linear_model import Lasso
lasso_poly_c = Lasso(alpha=5, max_iter=500000)
lasso_poly_c.fit(X_train_poly_s, y_train)
y_pred=lasso_poly_c.predict(X_test_poly_s)
r2_lasso_poly_c = r2_score(y_test, y_pred)
print(f"R² (Lasso + Polynomial): {r2_lasso_poly_c:.3f}")
print(f"Non-zero coefficients: {(lasso_poly_c.coef_ != 0).sum()} out of {len(lasso_poly_c.coef_)}")
#showing the coefficent path for lasso with fixed alpha on polynomial features
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Lasso
#Ignoring warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")
alphas = np.logspace(-2, 2, 100)
coefs = []
for a in alphas:
lasso = Lasso(alpha=a, max_iter=500000)
lasso.fit(X_train_poly_s, y_train)
coefs.append(lasso.coef_)
coefs = np.array(coefs)
plt.figure(figsize=(8, 6))
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as a Function of Alpha (Polynomial Features)')
# showing the features and alpha=5 line as legends
for i, feature in enumerate(feature_names):
plt.plot([], [], label=feature)
plt.axis('tight')
# showing a verical line at alpha=5 and writing alpha=5 next to it with angle=90 at the lower part of the line and showing legend
plt.axvline(x=5, color='red', linestyle='--', label='Alpha = 5')
plt.text(5.5, plt.ylim()[0], ' alpha=5', color='red', rotation=90, verticalalignment='bottom')
#placing legend outside the plot in two columns
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', ncol=3)
plt.show()
R² (Lasso + Polynomial): 0.490 Non-zero coefficients: 9 out of 55
Step 16: Identifying the Important Features¶
Again we want to create a plot to help us visualize which predictors remain active in the model and whether they have a positive or negative association with the target. Create a horizontal bar chart showing the non-zero Lasso coefficients for alpha = 5. Follow the instructions in Step 8 for guidance.
Hint: the top 5 features should be identical to plot from Step 8.
import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame of feature names and their coefficients
# Get feature names after polynomial expansion
feature_names = pf.get_feature_names_out()
coef_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': lasso_poly_c.coef_
})
#display the non-zero coefficients DataFrame in a sorted manner from largest to smallest
print("Non-zero Lasso Coefficients (alpha = 5) for Polynomial Features sorted from largest to smallest:")
nonzero_coef_df = coef_df[coef_df['Coefficient'] != 0]
display(nonzero_coef_df.sort_values(by='Coefficient', ascending=False))
# Filter out zero coefficients
nonzero_df = coef_df[coef_df['Coefficient'] != 0]
# Sort by absolute value for clearer visualization
nonzero_df = nonzero_df.reindex(nonzero_df['Coefficient'].abs().sort_values(ascending=True).index)
# Plot
plt.figure(figsize=(8,5))
plt.barh(nonzero_df['Feature'], nonzero_df['Coefficient'], color='green')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.title('Non-zero Lasso Coefficients (alpha = 5)')
plt.axvline(0, color='black', linewidth=1)
plt.tight_layout()
plt.show()
# showing intecept value
print(f"Lasso Intercept (alpha = 5) for Polynomial Features: {lasso_poly_c.intercept_}")
Non-zero Lasso Coefficients (alpha = 5) for Polynomial Features sorted from largest to smallest:
| Feature | Coefficient | |
|---|---|---|
| 2 | bmi | 25.690161 |
| 8 | s5 | 18.941873 |
| 3 | bp | 11.581286 |
| 27 | bmi bp | 2.471126 |
| 10 | age sex | 2.074981 |
| 33 | bmi s6 | 1.135663 |
| 12 | age bp | 0.500690 |
| 1 | sex | -2.519322 |
| 6 | s3 | -7.954854 |
Lasso Intercept (alpha = 5) for Polynomial Features: 153.73654390934846
We have reduced the number of feathures by Lasso model. It is noted that the Lasso model is not used for its performance, but the feature reduction. Now, lets perform a simple linear regression model using the top 9 features.¶
# Performing linear regression using only the top 9 features identified by Lasso with alpha=5
from scipy import stats
top_features = nonzero_coef_df['Feature'].tolist()[:9]
X_train_top = X_train_poly_s[:, [list(feature_names).index(f) for f in top_features]]
X_test_top = X_test_poly_s[:, [list(feature_names).index(f) for f in top_features]]
lr_top = LinearRegression().fit(X_train_top, y_train)
r2_lr_top = r2_score(y_test, lr_top.predict(X_test_top))
print(f"R² (Linear Regression with Top 9 Lasso Features): {r2_lr_top:.3f}")
# printing feature names of the top 9 features used with their coefficients sorted from largest to smallest
print("==================================================")
print("Top 9 features used in Linear Regression with their coefficients:")
for feature, coef in sorted(zip(top_features, lr_top.coef_), key=lambda x: x[1], reverse=True):
print(f"{feature}: {coef:.3f}")
R² (Linear Regression with Top 9 Lasso Features): 0.535 ================================================== Top 9 features used in Linear Regression with their coefficients: bmi: 26.000 s5: 20.701 bp: 16.173 age sex: 6.811 bmi bp: 5.597 bmi s6: 3.945 age bp: 3.387 sex: -10.860 s3: -13.894
It can be observed that the model’s fit has improved compared to the initial linear regression performed on the original features, which achieved an $R^2 = 0.453$.
After expanding the feature space using polynomial terms, applying Lasso regularization for feature selection, and then fitting a linear regression model on the reduced set, the performance increased to $R^2 = 0.535$.
This indicates that introducing nonlinear interactions and removing irrelevant features helped the model capture more variance in the target variable while maintaining simplicity.
📈 Model Evaluation through Diagnostic Plots¶
After building a regression model, it is essential to evaluate its performance both numerically and visually.
Visual diagnostics help reveal issues such as bias, nonlinearity, heteroscedasticity, and non-normal residuals that numeric metrics alone might miss.
Below are the most important plots and their mathematical foundations:
1️⃣ Predicted vs. Actual Plot¶
This plot compares the true values (y_test) with the predicted values (y_pred).
$$ \text{Prediction: } \hat{y}_i = f(\mathbf{x}_i) $$
A perfect model would satisfy:
$$ \hat{y}_i = y_i $$
Points lying on the 45° diagonal line ($y = \hat{y}$) represent perfect predictions,
while deviations from the line indicate underprediction or overprediction.
2️⃣ Residual Plot¶
The residual for each observation is defined as:
$$ e_i = y_i - \hat{y}_i $$
Residuals are plotted against fitted (predicted) values.
A good regression model should show residuals randomly scattered around zero:
$$ \mathbb{E}[e_i] = 0, \quad \text{Var}(e_i) = \sigma^2 $$
Patterns or systematic structures in the plot (e.g., curvature or funnels) suggest that the model’s assumptions are violated —
for example, missing nonlinear trends or heteroscedasticity (non-constant variance).
Here is an example of bad regression model:
Image taken from here:

This plot of residuals versus fits shows that the residual variance (vertical spread) increases as the fitted values (predicted values of sale price) increase. This violates the assumption of constant error variance [Ref]. In such case we need to refine our model or variables to improve the it.
3️⃣ Histogram of Residuals¶
The histogram (often with a KDE curve) shows how the residuals are distributed.
For a well-behaved regression model, residuals should approximately follow a normal (Gaussian) distribution centered at zero:
This means most errors are small and positive/negative errors are equally likely.
A symmetric, bell-shaped histogram indicates that the model’s errors are unbiased and consistent.
However:
- A skewed histogram suggests systematic overprediction or underprediction.
- Heavy tails (leptokurtic shape) indicate the presence of outliers or high-variance regions.
4️⃣ Q–Q Plot (Quantile–Quantile Plot)¶
To assess whether residuals follow a normal distribution, we standardize them:
$$ e_i^{*} = \frac{e_i - \bar{e}}{s_e} $$
where $ \bar{e} $ is the mean of residuals and $ s_e $ is their standard deviation.
We then plot the empirical quantiles of $ e_i^{*} $ against the theoretical quantiles of a standard normal distribution:
$$ q_i = \Phi^{-1}\left(\frac{i - 0.5}{n}\right) $$
where $ \Phi^{-1} $ is the inverse CDF (percent-point function) of the standard normal distribution.
If the residuals are normally distributed, the points should align closely with the diagonal reference line.
✅ Why These Plots Matter¶
- The Predicted vs. Actual plot checks overall accuracy and systematic bias.
- The Residual plot verifies that errors are independent and randomly distributed.
- The Histogram of Residuals confirms that residuals are approximately normal and unbiased.
- The Q–Q plot provides a formal visual test for the normality assumption.
Together, these figures provide a comprehensive diagnostic check, confirming whether the model is well-specified, unbiased, and generalizable.
👇 Now let's explore our model using the abovemnetioned figures.¶
# Plot Predicted vs Actual and the residual plot for Linear Regression with Top 9 Lasso Features
plt.figure(figsize=(10, 8))
# Predicted vs Actual
plt.subplot(2, 2, 1)
sns.scatterplot(x=y_test, y=lr_top.predict(X_test_top), color='purple', alpha=0.6, edgecolor='k')
# Perfect fit line (y = x)
plt.plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()],
color='red', linestyle='--', linewidth=2, label='Ideal fit (y_pred = y_test)')
# Labels and title
plt.xlabel('Actual Values (y_test)', fontsize=10)
plt.ylabel('Predicted Values (y_pred)', fontsize=10)
plt.title('Linear Regression with Top 9 Features Obtained from Lasso\nPredicted vs. Actual Values', fontsize=10)
# Move R² box slightly lower and right
plt.text(y_test.min() + (y_test.max() - y_test.min())*0.001,
y_test.max()*0.85,
f"R² = {r2_lr_top:.3f}",
fontsize=12, color='black', bbox=dict(facecolor='white', alpha=0.8))
# Move legend to top-right to avoid overlap
plt.legend(loc='upper left')
plt.tight_layout()
# Residuals vs. Predicted
plt.subplot(2, 2, 2)
sns.scatterplot(x=lr_top.predict(X_test_top), y=(y_test - lr_top.predict(X_test_top)),color='purple', alpha=0.6, edgecolor='k')
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values (y_pred)')
plt.ylabel('Residuals (y_test - y_pred)')
plt.title('Residuals vs. Predicted Values for\n Linear Regression with Top 9 Lasso Features', fontsize=10)
plt.tight_layout()
# showing the equation of the linear regression model with top 9 features on the plot
equation_text_top = "y_pred = "
for feature, coef in zip(top_features, lr_top.coef_):
equation_text_top += f"{coef:.3f}*{feature} + "
equation_text_top = equation_text_top[:-3] # Remove last " + "
equation_text_top += f" + {lr_top.intercept_:.3f}"
plt.gcf().text(0.1, -0.1, "The equation of the multiple linear model is:\n " + equation_text_top, fontsize=12, wrap=True, color='Red')
# Plotting Standardized QQ plot and histogram of residuals for Linear Regression with Top 9 Lasso Features in a subplot
# Standardized QQ plot with purple color
plt.subplot(2, 2, 3)
stats.probplot(y_test - lr_top.predict(X_test_top), dist="norm", plot=plt)
plt.title('Standardized QQ Plot of Residuals\nLinear Regression with Top 9 Lasso Features', fontsize=10)
# Histogram of residuals
plt.subplot(2, 2, 4)
sns.histplot(y_test - lr_top.predict(X_test_top), bins=20, kde=True, color='purple')
plt.title('Histogram of Residuals\nLinear Regression with Top 9 Lasso Features', fontsize=10)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
📊 Interpretation of Model Diagnostics¶
The diagnostic plots above illustrate how well the linear regression model (trained on the top 9 Lasso-selected features) performs on the Diabetes dataset:
Predicted vs. Actual Plot:
The points generally follow the red dashed line, indicating a moderate positive correlation between the predicted and actual values.In this example, the R² value shown on the plot quantifies how well the model explains the variance in the target variable:
$ R^2$ = 1.0 → perfect prediction
$ R^2$ = 0.0 → no predictive power (random guess)The coefficient of determination ($R^2 = 0.535$) suggests that the model explains about 53% of the variance in the target variable.
Residuals vs. Predicted Plot:
The residuals appear randomly scattered around zero, showing no strong pattern.
This indicates that the model captures most of the systematic variation, though the slightly wider spread at higher predicted values hints at mild heteroscedasticity.Q–Q Plot of Residuals:
Most points lie close to the diagonal reference line, implying that the residuals are approximately normally distributed.
Minor deviations at the tails suggest a few outliers but no severe violation of normality.Histogram of Residuals:
The residuals form a roughly symmetric, bell-shaped distribution centered near zero, supporting the assumption of unbiased errors.
Overall, the diagnostic plots confirm that the linear regression model fits the data reasonably well, with normally distributed and mostly independent residuals, though there remains room for improvement in explaining variance.
Section 3. Conclusion¶
Given everything you have explored in this exercise (and what we discussed in the course) about regularization, interaction terms, overfitting, and model generalizability, what is your perspective on the bias–variance trade-off and the interpretability trade-off for this dataset?
Take a moment to reflect and collect your thoughts before revealing my answer below.
Conclusion (my version)
Across both experiments, regularization consistently improves model performance, confirming its role in reducing variance and enhancing generalization.
When we extend the feature space to include interaction terms, we observe a modest additional improvement in test-set R², indicating that these nonlinear interactions capture a small portion of residual structure that the purely linear model could not explain.
However, as seen from both the performance metrics and the most influential predictors identified by Lasso (α = 5), the qualitative story remains the same: the model’s insights and key drivers of diabetes progression are largely unchanged.
This suggests that, for this dataset, the relationships between predictors and the target are predominantly linear and additive, and the marginal gains from including polynomial interactions may not justify the added complexity or overfitting risk for future data.
In practice, this highlights an important modeling principle:
Regularization helps control variance, while model expansion helps reduce bias; but the goal is not maximum complexity, it’s meaningful simplicity.
After finaluzing our model, we can use the residual plots to obtain an overal obervation of our model performance.
The notebook of this assignment can be found at my: GitHub Repository